Accelerate Cycle-Level Full-System Simulation of Multi-Core RISC-V Systems With Binary Translation
Accelerate Cycle-Level Full-System Simulation of Multi-Core RISC-V Systems With Binary Translation
needed, aggressive optimisations can be performed, and the perfor- provides only a functional simulation mode and supports no timing
mance is usually several magnitudes faster than timing simulators. or modelling of the memory system.
QEMU falls into this category. It should be noted though that while An accurate model of cache coherence and the memory hierar-
QEMU itself is a purely functional simulator, it can be modified to chy requires that multiple cores are simulated in lockstep (or in
collect metadata for off-line or on-line cache simulation [18]. a way that guarantees equivalent results). Simulators that forego
The other category is timing simulators. Out of timing simula- this are unable to properly simulate race conditions and shared
tors, RTL simulators can model processor microarchitectures very resources. Existing cycle-level simulators such as gem5 achieve
precisely, but the difficulty in implementing a feature in RTL sim- lockstep by iterating through all simulated cores each cycle. This
ulator is not much different from implementing it in hardware causes a significant performance drop. Spike (or riscv-isa-sim),
directly. RTL simulators are also poor in performance, usually run on the other hand, switches the active core to simulate less fre-
at a magnitude of kIPS [16]. quently. Its default compilation option only switches the core every
At a higher level, there are cycle-level microarchitectural simula- 1000 cycles, making it impossible to model race conditions where
tors. These are able to omit RTL implementation details to improve all cores are trying to acquire a lock simultaneously. No existing
performance while retaining a detailed microarchitectural model. binary translated simulators can model multi-core interaction in
An popular example is the gem5 simulator running with In-Order lockstep, and therefore none of these can model cache coherency
or O3 mode [4]. For faster performance, we can give up some extra or shared second-level cache properly.
microarchitectural details and predict the number of cycles taken
for each non-memory instruction instead of computing them in 3 IMPLEMENTATION
real-time, and in the extreme case, assume all non-memory opera-
3.1 Overview
tion only takes 1 cycle to execute as gem5’s “timing simple” CPU
model assumes. This approach is no longer cycle-accurate, but this The high-level control flow of R2VM, as shown in Figure 1, is similar
cycle-approximate model is often adequate to perform cache and to other binary translators. When an instruction at a particular
memory simulations. PC is to be executed, the code cache is looked up and the cached
translated binary is executed directly if found; otherwise, the binary
translator is invoked and an entire basic block is fetched, decoded,
2.2 Binary Translation and translated. We have used a variety of techniques to improve
Binary translation is a technique that accelerates instruction set the binary translator performance that are often found in other
architecture (ISA) simulation or program instrumentation [11]. An binary translators, such as block chaining [2].
interpreter will fetch, decode and execute the instruction pointed As full-system simulation is supported, we have to deal with
by the current program counter (PC) one-by-one, while binary the case that a 4-byte uncompressed instruction spans two pages.
translation will, either ahead of time (static binary translation) or in We handle this by creating a stub that reads the 2 bytes that lie
the runtime, i.e. when the block of code if first executed (dynamic on the second page each time the stub is executed, and patches
binary translation (DBT)), translate one or more basic blocks from the generated code if 2 bytes read are different from that of initial
the simulated ISA to the host’s native code, cache the result, and translation.
use the translation result next time the same block is executed. Cota et al. [9] suggests sharing a code cache between multiple
QEMU uses binary translation for cross-ISA simulation or when cores to promote code reuse and boost performance. In contrast,
there is no hardware virtualisation support [2]. Böhm et al. proposed we provide each hardware thread its own code cache. This allows
a method to introduce binary translation to single-core timing different code to be generated for each core, e.g. in the case of hetero-
simulation in 2010 [5]. geneous cores. This also lessens the synchronisation requirements
when modifying the code cache, simplify the implementation.
Binary Translator
Pipeline Model
Begin Code Generation
Execution
Block Begin Hook
Hit
Fetch & Decode
Code Miss
Cache Before Instruction Hook
Access
Translate Instruction
After Instruction Hook
No
End Block?
6 compiler.insert_cycle_count(1);
traditional threads which are preemptively scheduled by the oper-
7 }
ating system and are generally heavy-weight constructs. Fibers are
8
often used in I/O heavy, highly concurrent workloads, such as net-
9 fn after_taken_branch(&mut self, compiler: &mut
work programming, but this time we borrowed it to our simulator.
↩→ DbtCompiler, _op: &Op, _compressed: bool) { In our implementation, we create one fiber for each hardware
10 compiler.insert_cycle_count(1); thread simulated, plus a fiber for the event loop. Each time the
11 } pipeline model instructs the DBT to wait for a few cycles, we will
12 } generate a number of yields. Listing 2 shows an example of gener-
Listing 1: Timing simple model implementation ated code under timing simple model.
Fail
Check permission
Pass
Hit
Invoke simulated
cache model
Obtain address with XOR Insert the entry into L0 data cache
Miss
R2VM supports pipeline model switching by simply flushing the used a single-core micro-benchmark that is similar to the MemLat
code cache for translated binary, and let the DBT engine to use tool from the 7-zip LZMA benchmark [14]. For the MESI cache-
the new model’s hooks for code generation. Moreover, since as coherency model, we used a micro-benchmark to simulate a sce-
mentioned previously in Section 3.1, each core has its own code nario where two cores are heavily contending over a shared spin-
cache for DBT-ed code, we allow the pipeline models to be specified lock. The memory model under test is used together with the val-
per core rather than at once. idated in-order pipeline model, and we compare the number of
The memory model is switched in the runtime by flushing the cycles taken to execute a benchmark in R2VM and in RTL simu-
L0 data cache and the instruction cache. The cache line size is also a lation. The error is around for the 10% for the cache coherency
runtime-configurable property. For example, if both TLB and cache model and lower for non-coherent models. Though not as accurate
are simulated, the cache line size can be set to 64 bytes. If only TLB as the pipeline model, we believe at this accuracy the simulation
is simulated, the cache line size can be set to 4096 bytes, turning L0 can provide representative-enough metrics for exploring design
data cache effectively into an L0 data TLB. decisions.
If the memory model permits, R2VM can also switch between
lockstep execution and parallel execution like other binary trans- 4.2 Performance
lators during the runtime. Parallel execution is enabled on the
“atomic” memory model. When paired with the “atomic” pipeline R2VM,atomic 413
model this behaves functionally equivalent to QEMU and gem5’s R2VM,pipeline 334
atomic model which permits fast-forwarding of aforementioned
QEMU 269
booting and preparation steps.
R2VM,pipeline,lockstep 33
4 EVALUATION R2VM,simple,MESI 28
As described in Section 3, R2VM offers a range of pipeline models R2VM,pipeline,MESI 26
and memory models to select from, and allows switching between gem5,atomic 3
them mid-simulation. Each model shows different trade-offs. The
gem5,cycle 0.3
list of pre-implemented pipeline and memory models can be found
in Table 1 and Table 2. RTL 0.01
REFERENCES
[1] Oscar Almer, Igor Böhm, Tobias Edler Von Koch, Björn Franke, Stephen Kyle,
Volker Seeker, Christopher Thompson, and Nigel Topham. 2011. Scalable multi-
core simulation using parallel dynamic binary translation. In 2011 International
Conference on Embedded Computer Systems: Architectures, Modeling and Simula-
tion. IEEE, 190–199.
[2] Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX
Annual Technical Conference, FREENIX Track, Vol. 41. 46.
[3] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The
PARSEC benchmark suite: Characterization and architectural implications. In
Proceedings of the 17th international conference on Parallel architectures and com-
pilation techniques. ACM, 72–81.
[4] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali
Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh
Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH computer architecture
news 39, 2 (2011), 1–7.
[5] Igor Böhm, Björn Franke, and Nigel Topham. 2010. Cycle-accurate performance
modelling in an ultra-fast just-in-time dynamic binary translation instruction
set simulator. In 2010 International Conference on Embedded Computer Systems:
Architectures, Modeling and Simulation. IEEE, 1–10.
[6] Hadi Brais, Rajshekar Kalayappan, and Preeti Ranjan Panda. 2020. A Survey of
Cache Simulators. ACM Computing Surveys (CSUR) 53, 1 (2020), 1–32.
[7] The Embedded Microprocessor Benchmark Consortium. 2020. CoreMark. https:
//www.eembc.org/coremark/. Accessed: 2020-04-14.
[8] The Standard Performance Evaluation Corporation. 2017. SPEC CPU® 2017.
https://www.spec.org/cpu2017/. Accessed: 2020-04-14.
[9] Emilio G Cota and Luca P Carloni. 2019. Cross-ISA machine instrumentation
using fast and scalable dynamic binary translation. In Proceedings of the 15th ACM
SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 74–
87.
[10] Xuan Guo and Robert Mullins. 2019. Fast TLB Simulation for RISC-V Systems. In
Third Workshop on Computer Architecture Research with RISC-V.
[11] Kim Hazelwood. 2011. Dynamic binary modification: Tools, techniques, and
applications. Synthesis Lectures on Computer Architecture 6, 2 (2011), 1–81.
[12] Ankur Limaye and Tosiron Adegbija. 2018. A workload characterization of
the SPEC CPU2017 benchmark suite. In 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS). IEEE, 149–158.
[13] Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavy-
weight dynamic binary instrumentation. ACM Sigplan notices 42, 6 (2007), 89–100.
[14] Igor Pavlov. 2019. 7-Zip LZMA Benchmark. https://www.7-cpu.com/. Accessed:
2020-04-14.
[15] Ali Saidi and Andreas Sandberg. [n. d.]. gem5 Virtual Machine Acceleration.
http://www.m5sim.org/wiki/images/c/c3/2012_12_gem5_workshop_kvm.pdf.
[16] Tuan Ta, Lin Cheng, and Christopher Batten. 2018. Simulating Multi-Core RISC-V
Systems in gem5. In Workshop on Computer Architecture Research with RISC-V.
[17] The Rust Team. 2020. Rust Programming Language. https://www.rust-lang.org/.
Accessed: 2020-04-14.
[18] Tran Van Dung, Ittetsu Taniguchi, and Hiroyuki Tomiyama. 2014. Cache simula-
tion for instruction set simulator QEMU. In 2014 IEEE 12th International Conference
on Dependable, Autonomic and Secure Computing. IEEE, 441–446.