SMARTS - Accelerating Microarchitecture Simulation Via Rigorous Statistical Sampling
SMARTS - Accelerating Microarchitecture Simulation Via Rigorous Statistical Sampling
3.2. Benchmarks 1
0.0043%
0.0032%
0.0029%
0.0044%
0.0035%
0.0010%
0.0020%
0.0010%
0.0249%
0.0047%
0.0220%
0.0034%
0.0004%
0.0016%
0.0013%
0.0005%
0.0006%
0.0004%
0.0015%
0.0005%
0.0028%
0.0023%
Figure 3. Minimum instructions required. This graph shows the minimum number of instructions
which must be measured to achieve commonly used confidence intervals.
since the coefficients of variation of many benchmarks are tion rate of future processor cores). The right-hand-side
non-negligible even for sampling units of over one billion vertical axis estimates the corresponding runtimes on a
instructions. 2 GHz Pentium 4.
For U = 10, Figure 3 reports the values of n ⋅ U for The plot shows that SMARTS simulation speed
all benchmarks, assuming several commonly used confi- decreases from S F to S D as W is increased; furthermore,
dence targets. Even for a stringent confidence requirement the anticipated future S D results in an earlier and sharper
of ±1% error with 99.7% confidence, the worst-case decrease. Therefore, unless W can be bounded to a reason-
benchmark on the 8-way configuration in our study ably small value, full benchmark measurement by
requires no more than 0.1% of its instruction stream to be simulation sampling would remain prohibitively slow.
measured. The number of instructions required to achieve The simulation rate of SMARTS with functional
a particular level of confidence does not vary significantly warming can be derived from the expression for detailed
across benchmarks because, for the most part, the bench- warming by substituting S FW (the functional warming
marks have similar values of V CPI . The exceedingly low simulation rate) for S F . Functional warming allows us to
detailed simulation requirement suggests that the simula- bound W to less than a few thousand instructions—suffi-
tion rate of SMARTS is insensitive to the speed of the ciently few such that detailed warming does not affect the
detailed microarchitecture simulation. Rather, the rate simulation rate. This implies that the simulation rate of
depends on the speed of the functional simulation SMARTS with functional warming stays close to the simu-
performed for the great majority of the instruction stream lation rate of S FW and is relatively insensitive to the
between sampling units. This optimistic assessment of performance of the detailed simulator. In other words, the
speedup opportunity does not factor in the detailed simula- SMARTS framework enables researchers to apply other-
tion cost for microarchitectural state warming. We next wise prohibitively slow detailed simulators to study
present an analytical performance model for SMARTS to complete benchmarks, provided efficient functional
take into account the cost of detailed and functional warming is possible. In the next section, we will present
warming. our implementation of SMARTS where S FW ≈ 0.55 .
Figure 5. Optimal U. The left chart shows that the optimal U increases with W.
The right chart shows that U = 1000 is a reasonable choice across benchmarks and W.
4. SMARTS in practice 4.2. Optimal sampling unit size
To study and demonstrate the effectiveness of the SMARTSim allows the user to specify the sampling
SMARTS framework, we developed SMARTSim, a concrete unit size U. In the analysis in Section 3.3, we have shown
implementation of a sampling microarchitecture simulator. that smaller unit sizes reduce the number of instructions
In this section, we describe the implementation of simulated in detail if the cost of detailed warming is
SMARTSim and revisit the issues of microarchitectural ignored. However, because detailed warming adds an
state generation in greater detail. In particular, we explain overhead of W instructions of detailed simulation per
the effect of detailed warming on the choice of sampling sampling unit, the optimal value for U increases with
unit size and analyze the effectiveness of detailed warming increased W to amortize the overhead of detailed warming.
and functional warming in generating accurate microarchi- To illustrate the effect of W on the choice of U, Figure 5
tectural state for sample measurements. (left) plots the fraction of instructions simulated in detail
(i.e., n(W + U)/N) for various values of U and W. The data
4.1. SMARTSim points are based on SMARTSim execution of gcc-1 on the
8-way configuration, with n chosen for 99.7% confidence
SMARTSim is built on our enhanced sim-outorder
interval of ±3% in the CPI estimate. In the idealized case
as described in Section 3.2. Sim-outorder supports a
where W = 0, the minimum U leads to the fewest detail-
functional simulation mode, similar to the operation of
simulated instructions. For non-ideal W, however, the
sim-fast in SimpleScalar, that runs approximately 60
optimal value of U lies in the range of 100 to 10,000
times faster than detailed simulation. However, sim-
instructions. Figure 5 (right) locates the optimal values of
outorder only supports functional simulation prior to
U for three other benchmarks, gcc-3, bzip2-1, and mesa.
starting detailed simulation. SMARTSim allows repeated
Each benchmark is plotted for two values of W (1000 and
transitions back-and-forth between functional and detailed
100,000) that are approximately the magnitudes needed
simulation modes.
for sampling with and without functional warming, as
SMARTSim accepts sim-outorder command line
discussed in the following two sections. The optimal
arguments and configuration files. In addition, SMARTSim
choice of U is not fixed across benchmarks. However, in
accepts the systematic sampling parameters U, k, W, and j
all cases, including other SPEC2K benchmarks not shown,
(described in Section 3.1). SMARTSim also supports two
fixing U to 1000 leads to a sufficiently small fraction of
fast-forwarding options: functional simulation only and
detail-simulated instructions such that choosing the
functional simulation with warming (a.k.a. functional
optimal U gains at most tens of minutes in SMARTSim run
warming). For functional warming, SMARTSim performs
time. Therefore, we suggest using U = 1000 in all cases.
in-order functional instruction execution and maintains the
state of L1/L2 I/D caches, TLBs, and branch predictors in
a fashion similar to sim-cache and sim-bpred of
4.3. Effectiveness of detailed warming
SimpleScalar. In SMARTSim, functional warming opera- Microarchitectural state can always be warmed to an
tions introduce an overhead of approximately 75% over arbitrary degree of accuracy given sufficient detailed
functional simulation alone. warming. Unfortunately, the required amount of detailed
warming to obtain a given degree of accuracy cannot be
determined analytically. The required amount is a function
Table 5. CPI bias achieved with functional warming and minimal detailed warming.
8-way vpr galgel gcc-2 bzip2-2 parser gzip-5 facerec gcc-5 vortex-3 gcc-1 avg. rest (abs)
W = 2000 -1.6% 1.4% -1.1% -1.0% 1.0% 0.9% 0.9% -0.8% -0.6% -0.5% 0.2%
16-way mcf gcc-2 vortex-3 eon-2 gcc-5 sixtrack wupwise bzip2-1 applu mesa avg. rest (abs)
W = 4000 1.9% -1.6% 1.2% -1.1% -1.1% -0.9% 0.9% 0.8% 0.7% -0.6% 0.2%
10%
8-way 16-way
8%
CPI Error
avg. rest
ammp
ammp
apsi
apsi
gcc-2
bzip2-2
gcc-1
gap
bzip2-3
vortx-1
bzip2-1
gcc-3
vortx-1
bzip2-2
gcc-2
bzip2-3
fma3d
gcc-1
bzip2-1
vpr
parser
vpr
parser
facerec
Figure 6. SMARTS results across SPEC2K with n = 10,000. Unacceptably large confidence intervals
(e.g., 8-way ammp, vpr, and gcc-2) can be improved by simulating with ntuned.
avg. rest
ammp
apsi
bzip2-1
gcc-2
bzip2-2
gcc-1
gap
art-2
art-1
bzip2-3
vpr
lucas
confidence intervals are overshadowed by the microarchi-
tectural state warming bias. With the exception of gap, the
actual errors are within the confidence interval. For gap,
Figure 7. SMARTS EPI results with n = 10,000.
we have determined experimentally that the 2.2% error is
from the estimate, without affecting confidence. If the bias almost entirely due to bias.
can only be bounded, then it introduces a proportional Table 6 compares simulation runtimes for functional
amount of uncertainty in the estimate beyond the confi- (i.e., sim-fast), detailed (i.e., sim-outorder with
dence interval. detailed memory models), and SMARTSim simulation on a
2 GHz Pentium 4. SPEC2K benchmarks on the 8-way
5.2. Evaluation of performance and accuracy configuration with the highest instruction counts are
We applied the procedure outlined above to SPEC2K shown in sorted order. As shown in Table 6, detailed simu-
benchmarks using SMARTSim. Figure 6 reports results of lation takes on average 7.2 days and can take as long as 23
CPI estimated using SMARTSim in one run with days. In contrast, SMARTSim takes on average 5.0 hours
n init = 10,000. Benchmarks with the worst confidence and in the worst-case slightly less than 16 hours.
intervals are shown in sorted order, plus the average of the SMARTSim simulation speed is around 50% of functional-
remaining benchmarks. For each benchmark, we show the only simulation for most microarchitecture configurations.
actual achieved error and the predicted confidence interval
calculated from V̂ x for 99.7% confidence. The confidence 5.3. Comparison to SimPoint
interval accounts for random error in the estimated CPI A recent proposal, SimPoint [17], also enables
that is introduced by systematic sampling. Notice that reduced simulation turnaround time. SimPoint selects
actual error resulting from 10,000 sampling units is gener- representative subsets of benchmark traces via offline
ally much less than the predicted confidence interval. A analysis of basic blocks. Using clustering algorithms,
large part of this error can be attributed to the residual bias SimPoint selects and weights several large sampling units
of imperfect microarchitectural state warming (functional (up to ten 100M-instruction sampling units) such that the
warming with fixed W), with only a very small component frequency of each static basic block across the weighted
caused by statistical sampling. units matches that block’s frequency in the full dynamic
For most of the benchmarks, n init achieves a confi- stream. A fundamental assumption of SimPoint is that all
dence interval within ±3%. For benchmarks with dynamic instances of basic block sequences with similar
confidence intervals greater than ±3%, simulation profiles have the same behavior, therefore a particular
sampling needs to be repeated using n tuned —calculated sequence can be measured once and weighted appropri-
from the V̂ x of the initial sample. For example, rerunning ately to represent all remaining instances.
simulations for the 8-way configuration with n tuned of SimPoint has two key advantages: (1) due to large
66,531 (ammp), 23,321 (vpr), and 21,789 (gcc-2) achieve sampling units, SimPoint obviates the need for functional
actual errors of 1.1%, 0.1%, and -0.9% with confidence warming and can be more quickly integrated into a simula-
intervals of 3.0%, 2.9%, and 2.6%. To this confidence tion infrastructure, and (2) SimPoint allows early
interval, we must still add an uncertainty due to microar- termination of simulation after all selected sections have
chitectural state warming bias, which we empirically been visited. We implemented SimPoint with our
bound to below 2%. SimpleScalar toolset and verified our implementation
against the published configuration and results in [17]. For
Table 6. Runtimes for SMARTS compared to detailed and functional simulation. (8-way)
Runtime (hrs.) parser sixtrack mgrid galgel wupwise apsi twolf ammp mesa gap fma3d swim avg. rest
Detailed 541 466 414 405 346 344 343 323 279 266 265 223 98
Functional 9.2 7.9 7.0 6.9 5.9 5.8 5.8 5.5 4.7 4.5 4.5 3.8 1.7
SMARTS 15.8 13.6 12.1 11.8 10.1 10.1 10.0 9.6 8.1 7.8 7.8 6.5 2.9
9%
and performance, we propose the Sampling Microarchi-
6%
tecture Simulation (SMARTS) framework that applies
3% statistical sampling to microarchitecture simulation.
0% Unlike prior approaches to simulation sampling, SMARTS
avg. rest
ammp
sixtrack
gcc-2
gcc-1
fma3d
gcc-4
bzip2-1
gcc-3
gcc-5
gzip-2
gzip-5
vpr
prescribes an exact and constructive procedure for
sampling a minimal subset of a benchmark’s instruction
execution stream to estimate the performance of the
Figure 8. Comparison of SMARTS with SimPoint. complete benchmark with quantifiable confidence. The
SimPoint’s mean runtime per benchmark
SMARTS procedure obviates the need for full-stream simu-
is 2.8 hours compared to 5.0 hours for SMARTS.
lation by basing the strategy for optimal simulation
sampling on the outcomes of fast sampling simulation
the benchmarks in [17] and our 8-way configuration,
runs.
SimPoint resulted in an average improvement of 1.8 in
We evaluated the SMARTS framework in the context
simulation rate over SMARTS.
of a wide-issue out-of-order superscalar simulator running
However, SimPoint has several shortcomings: (1) it
SPEC2K benchmarks with varying inputs under two simu-
may result in arbitrarily high CPI error, (2) it does not
lated processor configurations. SMARTSim, an
offer quantifiable confidence in estimates, and (3) some
implementation of SMARTS, is created by modifying
microarchitecture configurations may cause large varia-
SimpleScalar’s sim-outorder to support systematic
tions in behavior across different instances of similarly-
sampling. The results of our evaluations demonstrated the
profiled basic block sequences.1
following: (1) SMARTSim achieves an actual average error
Figure 8 presents a comparison of CPI error between
of only 0.64% on CPI and 0.59% on EPI by simulating
SimPoint and SMARTS for the benchmarks presented in
fewer than 50 million instructions in detail per benchmark.
[17] running on our 8-way configuration. The comparison
(2) By simulating exceedingly small fractions of complete
shows that SimPoint has a higher average error (3.7% vs.
benchmarks, SMARTSim achieves effective speeds of 9.2
our 0.6%) and considerably higher worst-case error
MIPS and 9.0 MIPS simulating 8-way and 16-way out-of-
(-14.3% for gcc-2).
order processors on a 2 GHz Pentium 4. This corresponds
Gcc-2 is an example where SimPoint produces an
to speedups of 35 and 60 times over full-stream simulation
unacceptably high CPI error when running on our 8-way
with sim-outorder for the two configurations.
configuration. However, simulation using the published
The outcomes of this study have two fundamental
microarchitecture configuration in [17] only results in a
bearings on future simulator designs. First, designers
1.6% error. In gcc-2, we observed that the basic block
should not attempt to accelerate detailed simulators at the
sequences chosen by SimPoint exhibit large variations in
cost of coding complexity or abstraction errors; instead
their L2 miss rate—due to variations in data cache
designers should focus on increasing the simulator’s flexi-
locality—across dynamic instances on our microarchitec-
bility and realism. Second, designers should focus on
ture configuration. Therefore, in this case, the SimPoint
techniques to speed up fast-forwarding and functional
estimate based on just a single instance of the basic block
warming, because these ultimately determine sampling
sequences yields a large error. In contrast, independent of
simulation time.
benchmark and microarchitecture configuration, SMARTS
uses the measured coefficient of variation to help gauge
both the required sample size and the confidence in the Acknowledgment
estimates. The authors would like to thank Se-Hyun Yang, Zeba
Wunderlich, the members of the Carnegie Mellon Impetus
1. Consider a basic block comprised of a pointer-chasing loop. The
group, and the anonymous reviewers for their feedback on
execution time of each dynamic instance depends on whether the earlier drafts of this paper. This work was funded in part
pointer deference hits in the cache and hence is a function of the by grants from IBM and Intel corporations, an NSF
cache design and the precise memory placement of the linked list CAREER award, and an NSF Instrumentation award.
nodes.