0% found this document useful (0 votes)
22 views7 pages

Accelerate Cycle-Level Full-System Simulation of Multi-Core RISC-V Systems With Binary Translation

Uploaded by

zarkurayy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views7 pages

Accelerate Cycle-Level Full-System Simulation of Multi-Core RISC-V Systems With Binary Translation

Uploaded by

zarkurayy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Accelerate Cycle-Level Full-System Simulation of Multi-Core

RISC-V Systems with Binary Translation


Xuan Guo Robert Mullins
[email protected] [email protected]
University of Cambridge University of Cambridge
Cambridge, UK Cambridge, UK

ABSTRACT machines, and hardly possible for simulators to run to completion.


It has always been challenging to balance the accuracy and perfor- It is therefore helpful to have a fast simulator.
mance of instruction set simulators (ISSs). Register-transfer level Of course, fast simulation involves a trade-off between the fi-
(RTL) simulators or systems such as gem5 [4] are used to execute delity of the model and the speed at which simulations can be com-
programs in a cycle-accurate manner but are often prohibitively pleted. Unfortunately, we are currently forced to choose between
slow. In contrast, functional simulators such as QEMU [2] can run slow cycle-accurate simulators or fast functional-only simulators
large benchmarks to completion in a reasonable time yet capture such as QEMU. In particular, there is a lack of fast full-system
few performance metrics and fail to model complex interactions simulators that can accurately model cache-coherent multi-core
between multiple cores. This paper presents a novel multi-purpose processors.
simulator that exploits binary translation to offer fast cycle-level In this paper, we present the Rust RISC-V Virtual Machine (R2VM).
full-system simulations. Its functional simulation mode outper- R2VM is written in the increasingly popular high-level system pro-
forms QEMU and, if desired, it is possible to switch between func- gramming language Rust [17]. R2VM is released 1 under permissive
tional and timing modes at run-time. Cycle-level simulations of MIT/Apache-2.0 licenses in the hope to encourage its adoption and
RISC-V multi-core processors are possible at more than 20 MIPS, a expansion by the broader community. To our knowledge, this is the
useful middle ground in terms of accuracy and performance with first binary translated simulator that supports cycle-level multi-core
simulation speeds nearly 100 times those of more detailed cycle- simulation. It can accurately model cache coherency protocols and
accurate models. shared caches. Cycle-level simulations are possible at more than 20
MIPS, while the performance of functional-only simulations can
outperform QEMU and exceed 400 MIPS.
1 INTRODUCTION
RISC-V is a free, open, and extensible ISA. With the ongoing ecosys- 2 BACKGROUND
tem development of RISC-V and an increasing number of companies
and institutions switching to RISC-V for both production and re-
2.1 Instruction Set Simulators
search, RISC-V has become the test bed instruction set of computer ISSs can be classified as either execution-driven, emulation-driven
architecture research. A key tool when exploring new architectural or trace-driven [6]. We omit a detailed discussion of execution-
trade-offs is the instruction-set simulator (ISS). Fast cycle-level sim- driven simulators such as Cachegrind [13] that modify programs
ulation allows new ideas to be validated quickly at an appropriate with binary instrumentation and execute them natively, because
level of abstraction and without the complexities of hardware de- they require the host and the guest ISA to be identical and do not
velopment. In particular we focus on the challenge of simulating support full-system simulation. Emulation-driven simulators emu-
multi-core RISC-V systems. late the program execution, and gather performance metrics on-the-
The design of a processor can broadly be divided into the design fly; in contrast, trace-driven simulators run emulation before-hand
of the core and memory subsystem. Characterising the performance and gather traces from the program, e.g. branches or memory ac-
of the core pipeline in isolation is often a simpler task to that of cesses, and later replay the trace against a specific model. Traces
characterising the memory system. While smaller synthetic bench- allow ideas to be evaluated quickly without the need to simulate
marks are useful at the core level, larger more complex and longer in detail, but cannot easily capture effects that may alter the in-
running workloads are often needed to understand the memory structions that are executed, e.g. inter-core interactions or specu-
system and the potential interactions between cores. lative execution [6]. Moreover, storage space required for traces
For example, the smaller synthetic MCU benchmark CoreMark grows linearly with the length of execution, making trace-driven
[7] executes at a magnitude of 108 instructions per iteration, while simulators incapable of simulating large benchmarks. R2VM is an
SPEC2017 [8], a larger and more realistic benchmark running real- emulation-driven simulator and the remainder of this paper will
life applications, requires a magnitude of 1012 instructions for a focus exclusively on emulation-driven simulators.
single run [12]. SPEC takes from hours to days to run even on real Simulators can also be categorised by their levels of abstraction.
One category of simulators is functional simulators. Functional
simulators simulate the effects of instructions without taking mi-
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license. croarchitectural details into account. Because less information is
CARRV 2020, May 29, 2020, Virtual Workshop
© 2020 Copyright held by the owner/author(s).
1 Available at https://github.com/nbdd0121/r2vm.
CARRV 2020, May 29, 2020, Virtual Workshop Guo, et al.

needed, aggressive optimisations can be performed, and the perfor- provides only a functional simulation mode and supports no timing
mance is usually several magnitudes faster than timing simulators. or modelling of the memory system.
QEMU falls into this category. It should be noted though that while An accurate model of cache coherence and the memory hierar-
QEMU itself is a purely functional simulator, it can be modified to chy requires that multiple cores are simulated in lockstep (or in
collect metadata for off-line or on-line cache simulation [18]. a way that guarantees equivalent results). Simulators that forego
The other category is timing simulators. Out of timing simula- this are unable to properly simulate race conditions and shared
tors, RTL simulators can model processor microarchitectures very resources. Existing cycle-level simulators such as gem5 achieve
precisely, but the difficulty in implementing a feature in RTL sim- lockstep by iterating through all simulated cores each cycle. This
ulator is not much different from implementing it in hardware causes a significant performance drop. Spike (or riscv-isa-sim),
directly. RTL simulators are also poor in performance, usually run on the other hand, switches the active core to simulate less fre-
at a magnitude of kIPS [16]. quently. Its default compilation option only switches the core every
At a higher level, there are cycle-level microarchitectural simula- 1000 cycles, making it impossible to model race conditions where
tors. These are able to omit RTL implementation details to improve all cores are trying to acquire a lock simultaneously. No existing
performance while retaining a detailed microarchitectural model. binary translated simulators can model multi-core interaction in
An popular example is the gem5 simulator running with In-Order lockstep, and therefore none of these can model cache coherency
or O3 mode [4]. For faster performance, we can give up some extra or shared second-level cache properly.
microarchitectural details and predict the number of cycles taken
for each non-memory instruction instead of computing them in 3 IMPLEMENTATION
real-time, and in the extreme case, assume all non-memory opera-
3.1 Overview
tion only takes 1 cycle to execute as gem5’s “timing simple” CPU
model assumes. This approach is no longer cycle-accurate, but this The high-level control flow of R2VM, as shown in Figure 1, is similar
cycle-approximate model is often adequate to perform cache and to other binary translators. When an instruction at a particular
memory simulations. PC is to be executed, the code cache is looked up and the cached
translated binary is executed directly if found; otherwise, the binary
translator is invoked and an entire basic block is fetched, decoded,
2.2 Binary Translation and translated. We have used a variety of techniques to improve
Binary translation is a technique that accelerates instruction set the binary translator performance that are often found in other
architecture (ISA) simulation or program instrumentation [11]. An binary translators, such as block chaining [2].
interpreter will fetch, decode and execute the instruction pointed As full-system simulation is supported, we have to deal with
by the current program counter (PC) one-by-one, while binary the case that a 4-byte uncompressed instruction spans two pages.
translation will, either ahead of time (static binary translation) or in We handle this by creating a stub that reads the 2 bytes that lie
the runtime, i.e. when the block of code if first executed (dynamic on the second page each time the stub is executed, and patches
binary translation (DBT)), translate one or more basic blocks from the generated code if 2 bytes read are different from that of initial
the simulated ISA to the host’s native code, cache the result, and translation.
use the translation result next time the same block is executed. Cota et al. [9] suggests sharing a code cache between multiple
QEMU uses binary translation for cross-ISA simulation or when cores to promote code reuse and boost performance. In contrast,
there is no hardware virtualisation support [2]. Böhm et al. proposed we provide each hardware thread its own code cache. This allows
a method to introduce binary translation to single-core timing different code to be generated for each core, e.g. in the case of hetero-
simulation in 2010 [5]. geneous cores. This also lessens the synchronisation requirements
when modifying the code cache, simplify the implementation.

2.3 Multi-core Simulation 3.2 Pipeline Simulation


Extending single core simulators to handle multiple cores is compli- The main difference between our simulator’s flow and existing ones
cated by the performance implications of the ways in which cores such as QEMU is that we introduce “pipeline model”s, which com-
may interact. As cores share caches and memory, simulations of prises several hooks. Hooks can process relevant instructions and
individual cores cannot simply be run independently. For example, generate essential microarchitectural simulation code if necessary.
accurate modelling of cache coherence, atomic memory operations The hooks can also indicate the number of cycles it would take for
and inter-processor interrupts (IPIs) must be considered. the instruction to complete. It should be noted that this is only for
Böhm et al.’s modified ARCSim simulator [5] can model single- the execution pipeline, while memory systems and cache are in a
core processors with high accuracy and reasonable performance; separate component.
however, Almer et al.’s extension to Böhm et al.’s work [1] that es- For simple models, such as gem5’s “timing simple” model where
sentially runs multiple copies of the single-core simulator in parallel each instruction takes 1 cycle to execute, implementation is straight-
threads to provide multi-core support is limited in its fidelity. The forward as shown in Listing 1.
author comments “detailed behaviour of the shared second-level We have also implemented and validated an in-order pipeline
cache, processor interconnect and external memory of the simu- model that accurately models a classic 5-stage pipeline with a static
lated multi-core platform” cannot be modelled accurately. QEMU branch predictor. Our implementation captures pipeline hazards,
is able to exploit multiple cores to emulate a multi-core guest but such as data hazards caused by load-use dependency and stalls due
Accelerate Cycle-Level Multi-Core RISC-V Simulation with Binary Translation CARRV 2020, May 29, 2020, Virtual Workshop

Binary Translator

Pipeline Model
Begin Code Generation
Execution
Block Begin Hook
Hit
Fetch & Decode
Code Miss
Cache Before Instruction Hook
Access
Translate Instruction
After Instruction Hook

No
End Block?

Yes After Taken Branch Hook


Code Cache
Complete Code
Generation

Figure 1: Control flow overview of the simulator


1 #[derive(Default)] only synchronise 1 million times per second even after careful
2 pub struct SimpleModel; optimisation at the assembly level.
3
3.3.1 Lockstep Simulation. The approach we use takes inspira-
4 impl PipelineModel for SimpleModel {
tion from fibers, sometimes also referred to as coroutines or green
5 fn after_instruction(&mut self, compiler: &mut
threads. Fibers are cooperatively scheduled by the user-space appli-
DbtCompiler, _op: &Op, _compressed: bool) {
cation, and they voluntarily “yield” to other fibers, in contrast to
↩→

6 compiler.insert_cycle_count(1);
traditional threads which are preemptively scheduled by the oper-
7 }
ating system and are generally heavy-weight constructs. Fibers are
8
often used in I/O heavy, highly concurrent workloads, such as net-
9 fn after_taken_branch(&mut self, compiler: &mut
work programming, but this time we borrowed it to our simulator.
↩→ DbtCompiler, _op: &Op, _compressed: bool) { In our implementation, we create one fiber for each hardware
10 compiler.insert_cycle_count(1); thread simulated, plus a fiber for the event loop. Each time the
11 } pipeline model instructs the DBT to wait for a few cycles, we will
12 } generate a number of yields. Listing 2 shows an example of gener-
Listing 1: Timing simple model implementation ated code under timing simple model.

1 mov rax, qword [rbp+0x78] ; \


2 mov qword [rbp+0x70], rax ; | add a4, zero, a5
to a branch/jump into a misaligned 4-byte instruction. Unlike Böhm 3 call fiber_yield_raw ; /
et al.’s simulator [5] which needs to call a “pipeline” function after 4 mov eax, dword [rbp+0x78] ; \
each instruction, our implementation models pipeline behaviours 5 add eax, -0x1 ; |
during DBT code generation and reflects them as number of cycles 6 cdqe ; | addiw a5, a5, -1
taken, therefore requires no explicit code to be executed in runtime. 7 mov qword [rbp+0x78], rax ; |
More complex processors may need either to make an estimation 8 call fiber_yield_raw ; /
of pipeline states (and sacrifice some accuracy) or generate custom 9 mov eax, dword [rbp+0x70] ; \
assembly in the hooks to maintain these states during execution 10 imul eax, dword [rbp+0x50] ; |
(and sacrifice some performance). 11 cdqe ; | mulw a0, a4, a0
12 mov qword [rbp+0x50], rax ; |
13 call fiber_yield_raw ; /
3.3 Multi-core Simulation
The techniques we described in the previous section works well for Listing 2: Example of generated code with yield calls. RBP
single-core systems. But as described in the background section, run- points to the array of RISC-V registers.
ning them in parallel or switch between them in a coarse-grained
manner has a huge impact on simulation accuracy of multi-threaded Different from normal fiber implementation, we engineered the
programs. The ideal scheduling granularity is, therefore, a cycle, fiber’s memory layout to look like Figure 2 to suit the need of a
i.e. having all simulated cores run in lockstep. This is, however, simulator. Each fiber is allocated with a 2M memory aligned to
difficult to achieve for binary translators. 2M boundary, and the stack for running code under the fiber is
We experimented the idea of using thread barriers to synchronise contained within this memory range. The alignment requirement al-
multiple threads each simulating a single core. It turns out we could lows the fiber’s start address to be recovered from the stack pointer
CARRV 2020, May 29, 2020, Virtual Workshop Guo, et al.

Event Loop Core 0 Core 1

Stack Stack Stack

Current Stack Pointer

Events Priority L0 Address L0 Address


Queue Translation Cache Translation Cache
Next Event Core States Core States Current Base Pointer

Cycle Number Registers Registers


Next Fiber Next Fiber Next Fiber
Stack Pointer Stack Pointer Stack Pointer

Figure 2: Memory layout of fibers


by simply masking out least significant 21 bits. The base pointer mid-way, we choose to check for interrupts only at the end of basic
points to the end of fixed fiber structures rather than the beginning, blocks. We believe that this decision will not affect the accuracy of
so that positive offsets from the base pointer can be used freely our simulation due to the inherent entropy of I/O operations.
by the DBT-ed code, while the negative offsets are used for fiber The positioning of yielding that lies in between two synchronisa-
management. tion points, therefore, would have no visible side-effects and cannot
The ABI of the host platform for DBT-ed code is not respected; be distinguished. Our implementation postpones all yielding until
we rather specify all registers other than the base pointer and stack the next synchronisation point. We tweaked our yield implementa-
pointer to be volatile, or caller-saved. By doing so, fiber_yield_raw tion as shown in Listing 3 slightly to allow multi-cycle yield, and
does not need to bear the cost of saving any registers. To yield in it demonstrates around 10% performance gain compared to naive
non-DBT-ed code, we can alternatively push ABI-specified callee- yielding.
saved registers into the stack and switch.
This careful design makes fiber switching lightning fast; the 3.4 Memory Simulation
fiber_yield_raw function is as simple as 4 instructions on AMD64, Previous sections described how we simulate each core’s processing
shown in Listing 3. pipeline and how we achieve simulation in lockstep. The techniques
we described and implemented speeds up pipeline simulation, but
1 fiber_yield_raw: the speedup could be very limited when all memory accesses are
2 mov [rbp - 32], rsp ; Save current stack pointer still simulated. Moreover, the instruction cache and translation-
3 mov rbp, [rbp - 16] ; Move to next fiber lookaside buffers (TLBs) would also need to be simulated for accu-
4 mov rsp, [rbp - 32] ; Restore stack pointer racy.
5 ret
3.4.1 L0 Data Cache. For memory operations, each running core
Listing 3: Implementation of the fiber yielding code has its own “L0 data cache”. When a core needs to read from or
write to a memory address, it first checks if it is in the L0 data cache.
If it hits, then memory access is performed entirely within DBT-ed
3.3.2 Synchronisation Points. Simply yielding a few cycles after code, bypassing the memory model entirely.
every executed instruction will severely limit performance and As a result, the memory model will not intercept all memory
in many cases will be unnecessary. In practice, we only need to accesses. It is therefore important to control what could be in the L0
synchronise at points where the execution pipeline can produce data cache. We maintain a property that if an access hits the L0 data
visible side-effects to other cores and/or the rest of the system, or cache, then it must be a cache hit would the memory access reach
where the rest of the system’s behaviour would affect the running the memory model. We speed up TLB simulation with a similar
pipeline. approach in our previous work [10].
We observe that there are three ways that a pipeline interacts In our previous TLB simulation work, the property mandates
with another: an invariant that all entries in L0 TLB are in the L1 data TLB. The
• An memory operation is performed. invariant kept in R2VM is that all L0 data cache entries are contained
• An control register operation is performed. This includes both in L1 data TLB and L1 data cache. Therefore, as shown in
read/write to performance monitor registers, or control reg- Figure 3, when entries are evicted from either the simulated TLB
isters related to the memory system. or cache model, corresponding entries need to be flushed from the
• An interrupt happens. L0 data cache for the inclusiveness property.
For the first two types of interaction, we insert a synchronisation We carefully engineered the memory layout of L0 data cache
point before and after they are executed. For the third case (inter- entries for maximum efficiency. The L0 data cache is direct-mapped,
rupts), because it is generally difficult to interrupt an DBT-ed code with each entry representing a cache line. Each entry has a memory
Accelerate Cycle-Level Multi-Core RISC-V Simulation with Binary Translation CARRV 2020, May 29, 2020, Virtual Workshop

Dynamic Binary Translated Code R2VM Memory Model

Index into L0 Fail Invoke simulated


data cache and
TLB model
check tag Hit
Miss
Pass
Flush entries from L0 data cache Walk page tables and update simulated
(for TLB/cache eviction) TLB

Fail
Check permission

Pass

Hit
Invoke simulated
cache model
Obtain address with XOR Insert the entry into L0 data cache
Miss

Update simulated cache


Perform the actual memory access Trigger a page fault exception

Figure 3: Control flow for memory accesses


63 1 0
an instruction is executed, we instead do it only when a basic block
T vtag RO begins, or when the instruction being translated is in a different
A paddr ⊕ vaddr cache line compared to the previous instruction. For a cache line
size of 64 bytes, this means that we only need to generate a single
Figure 4: Memory layout of a tag entry in L0 data cache L0 instruction cache access for every 16-32 instructions.
We also creatively use the L0 instruction cache to optimise jumps
layout like Figure 4. It does not store actual memory contents; it across pages. Traditionally, because the page mapping might change
rather stores a translation from the virtual tag to a physical address. and therefore the actual target of jump instruction might change,
In a sense, it is more like a TLB with cache-line granularity than DBTs have to conservatively not link these blocks together. We
a cache. We pack the XOR-ed value of guest physical address and instead check the L0 instruction cache (which we would need to
corresponding guest virtual address, plus a bit indicating if the check anyway when next block begins) and see if the target is the
cache line is read-only to a single machine word. same as the cached target. If so, the cached target is used and the
For each memory access, the L0 data cache is indexed into using control does not go back to the main loop.
the virtual tag. For read access, we compare if T >> 1 is equal to
vtag. For write access, we compare if vtag << 1 is equal to T. If 3.4.3 Cache Coherency. Our design for the memory system inher-
the check passes, the requested virtual address is XOR-ed with A ently supports the use of cache coherency. Whenever the cache
to produce the address to access directly within DBT-ed code. If coherency protocol requires an invalidation, it can be flushed from
the check fails, the cold path is executed and the memory model the L0 data cache of the target core. Because all simulated cores
is invoked. The memory model will simulate both TLB and data execute in lockstep, and there are synchronisation points before
cache, and either triggers a page fault or inserts an entry into the all memory accesses, the effect of the invalidation will be visible
L0 data cache. before the next memory access.
The existence of L0 data cache promises the performance of
R2VM’s fast-path, because it requires only 3 memory operations 3.5 Runtime Reconfiguration
for each memory operation simulated. In the default configuration,
R2VM is capable of doing user-level simulation, supervisor-level
because the memory model does not intercept all memory accesses,
simulation and machine-level simulation. For user-level simulation,
replacement policies such as least-recently used (LRU) cannot be
Linux syscalls are emulated, and for supervisor-level, supervisor
used for the simulated TLB and cache. Generally, we believe this
binary interface (SBI) calls are emulated.
is an acceptable accuracy loss to trade vastly better simulation
In many cases, we want to gather cache statistics with the be-
performance. If LRU-like policies are really needed, the L0 data
haviour of operating system (OS) taken into account, but we do
cache could be bypassed and the memory model be invoked for
not want to count the OS booting and workload preparation steps
each memory access, in sacrifice of performance.
before the region of interest, and do not want to pay for the perfor-
3.4.2 L0 Instruction Cache. R2VM also simulates instruction TLB mance overhead of detailed models for these portions. The design
and caches similar to the data cache. Each core also has its own L0 of R2VM takes this into account, and both pipeline and memory
instruction cache, with a simpler entry layout because read/write models can be switched dynamically in the runtime. The switching
permission needs not to be distinguished. To keep the overhead of is controlled by writing a special control and status register (CSR)
simulating instruction cache down, instead of accessing it each time in the vendor-specific CSR range.
CARRV 2020, May 29, 2020, Virtual Workshop Guo, et al.

R2VM supports pipeline model switching by simply flushing the used a single-core micro-benchmark that is similar to the MemLat
code cache for translated binary, and let the DBT engine to use tool from the 7-zip LZMA benchmark [14]. For the MESI cache-
the new model’s hooks for code generation. Moreover, since as coherency model, we used a micro-benchmark to simulate a sce-
mentioned previously in Section 3.1, each core has its own code nario where two cores are heavily contending over a shared spin-
cache for DBT-ed code, we allow the pipeline models to be specified lock. The memory model under test is used together with the val-
per core rather than at once. idated in-order pipeline model, and we compare the number of
The memory model is switched in the runtime by flushing the cycles taken to execute a benchmark in R2VM and in RTL simu-
L0 data cache and the instruction cache. The cache line size is also a lation. The error is around for the 10% for the cache coherency
runtime-configurable property. For example, if both TLB and cache model and lower for non-coherent models. Though not as accurate
are simulated, the cache line size can be set to 64 bytes. If only TLB as the pipeline model, we believe at this accuracy the simulation
is simulated, the cache line size can be set to 4096 bytes, turning L0 can provide representative-enough metrics for exploring design
data cache effectively into an L0 data TLB. decisions.
If the memory model permits, R2VM can also switch between
lockstep execution and parallel execution like other binary trans- 4.2 Performance
lators during the runtime. Parallel execution is enabled on the
“atomic” memory model. When paired with the “atomic” pipeline R2VM,atomic 413
model this behaves functionally equivalent to QEMU and gem5’s R2VM,pipeline 334
atomic model which permits fast-forwarding of aforementioned
QEMU 269
booting and preparation steps.
R2VM,pipeline,lockstep 33
4 EVALUATION R2VM,simple,MESI 28
As described in Section 3, R2VM offers a range of pipeline models R2VM,pipeline,MESI 26
and memory models to select from, and allows switching between gem5,atomic 3
them mid-simulation. Each model shows different trade-offs. The
gem5,cycle 0.3
list of pre-implemented pipeline and memory models can be found
in Table 1 and Table 2. RTL 0.01

0.001 0.01 0.1 1 10 100 1000


Name Description M Instructions per CPU second
Atomic Cycle count not tracked Figure 5: Performance comparison between models and
Simple Each non-memory instruction takes one cycle other simulators
InOrder Models a simple 5-stage in-order scalar pipeline
We evaluated the performance of R2VM against QEMU using
Table 1: List of pre-implemented pipeline models
the deduplication workload from PARSEC [3] on 4 cores to test the
integer performance of the simulator (as both R2VM and QEMU
Name Description
interprets floating-point operations). The kIPS numbers of the gem5
Atomic Memory accesses not tracked
simulator are from Saidi et al.’s presentation [15].
TLB TLB hit rate collected; cache not simulated
As shown in Figure 5, the techniques we use lead to superb per-
Cache Cache hit rate collected; TLB and cache coherency not
formance. When caches are not simulated and therefore cores can
modelled; parallel execution allowed
run in parallel threads, R2VM runs at >300 MIPS per core, even out-
MESI A directory-based MESI cache coherency protocol
performing QEMU. Lockstep execution brings down performance
with a shared L2. Lockstep execution required.
by 10x to ∼30 MIPS (for 4 simulated cores in a single-threaded), but
Table 2: List of pre-implemented memory models this is still significantly faster than gem5.
Thanks to our pipeline model design which moves most simu-
4.1 Accuracy and Validation lation to DBT compilation time rather than runtime, and to our
For pipeline models, we validated the accuracy of the in-order memory model design which offloads most memory accesses by
model against an actual RTL implementation of a RISC-V core using using L0 caches, simulating pipelines and cache coherency proto-
CoreMark [7]. CoreMark is particularly helpful for this validation, cols did not add a significant overhead themselves, compared to
as CoreMark’s working set is small enough to fit into caches and the overhead of lockstep execution.
therefore the memory system of the RTL implementation would not
affect the benchmark result. In our run, the RTL implementation 5 CONCLUSION
reports 2.10 CoreMark/MHz where the in-order model, when paired We have introduced R2VM, a multi-purpose binary translating
with the atomic memory model, reports 2.09 CoreMark/MHz. The simulator that is able to simulate multi-core RISC-V systems at
difference is less than 1%. The “simple” model is simply validated the cycle-level at high-speed. This is done by leveraging the use
by checking that all cores have their MCYCLE and MINSTRET CSR of fibers to support fast lockstep execution. Overall, optimisations
equal. made R2VM possible to achieve functional simulation performance
For memory models, we used a few micro-benchmarks to cover that exceeds that of QEMU and cycle-level simulation nearly 100
the use case for each model. For TLB and cache simulation, we times faster than gem5.
Accelerate Cycle-Level Multi-Core RISC-V Simulation with Binary Translation CARRV 2020, May 29, 2020, Virtual Workshop

REFERENCES
[1] Oscar Almer, Igor Böhm, Tobias Edler Von Koch, Björn Franke, Stephen Kyle,
Volker Seeker, Christopher Thompson, and Nigel Topham. 2011. Scalable multi-
core simulation using parallel dynamic binary translation. In 2011 International
Conference on Embedded Computer Systems: Architectures, Modeling and Simula-
tion. IEEE, 190–199.
[2] Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX
Annual Technical Conference, FREENIX Track, Vol. 41. 46.
[3] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The
PARSEC benchmark suite: Characterization and architectural implications. In
Proceedings of the 17th international conference on Parallel architectures and com-
pilation techniques. ACM, 72–81.
[4] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali
Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh
Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH computer architecture
news 39, 2 (2011), 1–7.
[5] Igor Böhm, Björn Franke, and Nigel Topham. 2010. Cycle-accurate performance
modelling in an ultra-fast just-in-time dynamic binary translation instruction
set simulator. In 2010 International Conference on Embedded Computer Systems:
Architectures, Modeling and Simulation. IEEE, 1–10.
[6] Hadi Brais, Rajshekar Kalayappan, and Preeti Ranjan Panda. 2020. A Survey of
Cache Simulators. ACM Computing Surveys (CSUR) 53, 1 (2020), 1–32.
[7] The Embedded Microprocessor Benchmark Consortium. 2020. CoreMark. https:
//www.eembc.org/coremark/. Accessed: 2020-04-14.
[8] The Standard Performance Evaluation Corporation. 2017. SPEC CPU® 2017.
https://www.spec.org/cpu2017/. Accessed: 2020-04-14.
[9] Emilio G Cota and Luca P Carloni. 2019. Cross-ISA machine instrumentation
using fast and scalable dynamic binary translation. In Proceedings of the 15th ACM
SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 74–
87.
[10] Xuan Guo and Robert Mullins. 2019. Fast TLB Simulation for RISC-V Systems. In
Third Workshop on Computer Architecture Research with RISC-V.
[11] Kim Hazelwood. 2011. Dynamic binary modification: Tools, techniques, and
applications. Synthesis Lectures on Computer Architecture 6, 2 (2011), 1–81.
[12] Ankur Limaye and Tosiron Adegbija. 2018. A workload characterization of
the SPEC CPU2017 benchmark suite. In 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS). IEEE, 149–158.
[13] Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavy-
weight dynamic binary instrumentation. ACM Sigplan notices 42, 6 (2007), 89–100.
[14] Igor Pavlov. 2019. 7-Zip LZMA Benchmark. https://www.7-cpu.com/. Accessed:
2020-04-14.
[15] Ali Saidi and Andreas Sandberg. [n. d.]. gem5 Virtual Machine Acceleration.
http://www.m5sim.org/wiki/images/c/c3/2012_12_gem5_workshop_kvm.pdf.
[16] Tuan Ta, Lin Cheng, and Christopher Batten. 2018. Simulating Multi-Core RISC-V
Systems in gem5. In Workshop on Computer Architecture Research with RISC-V.
[17] The Rust Team. 2020. Rust Programming Language. https://www.rust-lang.org/.
Accessed: 2020-04-14.
[18] Tran Van Dung, Ittetsu Taniguchi, and Hiroyuki Tomiyama. 2014. Cache simula-
tion for instruction set simulator QEMU. In 2014 IEEE 12th International Conference
on Dependable, Autonomic and Secure Computing. IEEE, 441–446.

You might also like