Instrumentation Wasm
Instrumentation Wasm
WebAssembly
Ben L. Titzer Elizabeth Gilbert Bradley Wei Jie Teo
[email protected] [email protected] [email protected]
Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
Pittsburgh, PA, USA Pittsburgh, PA, USA Pittsburgh, PA, USA
5
key innovation in our system that makes composing multiple This effectively requires immediate deoptimization of a
analyses reliable. With these guarantees, probes from multi- frame, also guaranteed by JVMTI. Otherwise, if execution
ple monitors do not interfere, making monitors composable continues in JIT-compiled code, almost any invariant the JIT
and deterministic. Monitors can be used in any combination relied on could be invalid, and it may appear that updates
without explicit foresight in their implementation. have not occurred yet, violating consistency.
2.4.1 Deterministic firing order. What should happen 2.4.3 Multi-threading. While Wizard is not currently
if a probe 𝑝 at location 𝐿 fires and inserts another probe 𝑞 multi-threaded, WebAssembly does have proposals to add
at the same location 𝐿? Should the new probe 𝑞 also fire threading capabilities which Wizard must eventually support.
before returning to the program, or not? Similarly, if probes That brings with it the possibility of multi-threaded instru-
𝑝 and 𝑞 are inserted on the same event, is their firing order mentation. Locks around insertion and removal of probes
predictable? should maintain our consistency guarantees through serial-
We found that a guaranteed probe firing order is subtly izing dynamic instrumentation requests. Our design inher-
important to the correctness of some monitors (e.g. the func- ently separates monitor state from program state. Thus data
tion entry/exit utility shown in Section 2.5). For this reason, races on the monitor state are the responsibility of the mon-
we guarantee three dynamic probe consistency properties: itors, for example by using lock-free data structures and/or
• Insertion order is firing order: Probes inserted on the locks at the appropriate granularity. The FrameAccessor
same event 𝐸 fire in the same order as they were in- can also include synchronization to prevent data races on
serted. Wasm state10 .
• Deferred inserts on same event: When a probe fires 2.5 Function Entry/Exit Probes
on event 𝐸 and inserts new probes on the same 𝐸, the
new probes do not fire until the next occurrence of 𝐸. Probes are a low-level, instruction-based instrumentation
• Deferred removal on same event: When a probe fires mechanism, which is natural and precise when interfacing
on event 𝐸 and removes probes on the same 𝐸, the re- with a VM. Yet many analyses focus on function-level behav-
moved probes do fire on this occurrence of 𝐸 but not ior and are interested in calls and returns. Instrumentation
subsequent occurrences. hooks for function entry/exit make such analyses much eas-
ier to write.
2.4.2 Frame modifications. As shown, the FrameAccessor At first glance, detecting function entry can be done by
provides a mostly read-only interface to program state. Since probing the first bytecode of a function, and exit can be de-
monitors run in the engine’s state space, and not the Wasm tected by probing all returns, throws, and brs that target
program’s state space, by construction this guarantees that the function’s outermost block. However, some special cases
monitors do not alter the program behavior. However, some make this tricky. First, a function may begin with a loop;
monitors, such as a debugger’s fix-and-continue opera- the entry probe must distinguish between the first entry to a
tion, or fault-injection, intentionally change program state. function, a backedge of the loop, and possible (tail-)recursive
For an interpreter, modifications to program state, such calls. Second, local exits are not enough: frames can be un-
as local variables, require no special support, since inter- wound by a callee throwing an exception caught higher in
preters typically do not make assumptions across bytecode the callstack.
boundaries. For JIT-compiled code, any assumption about Should the VM support function entry/exit as special
program state could potentially be violated by M-code frame hooks for probes? Interestingly, we find this is not strictly
modifications. Depending on the specific circumstance, con- necessary. This functionality can be built from the programma-
tinuing to run JITed code after state changes might exhibit bility of local probes and offered as a library. There are sev-
unpredictable program behavior9 . eral possible implementation strategies: 1) use entry probes
It’s important for the engine to provide a consistency that push the FrameAccessor objects onto an internal stack,
model for state changes made through the FrameAccessor. with exit probes popping; 2) sampling the stack depth via
When monitors explicitly intend to alter the program’s be- the FrameAccessor’s depth() method; or 3) by instrument-
havior, it is natural for them to expect state changes to take ing, and thus ignoring, loop backedges. Thus, function en-
effect immediately, as if the program is running in an inter- try/exit reside above global/local probes in the hierarchy
preter. Thus, our system guarantees: of instrumentation mechanisms. This is further evidence
that the programmability of probes allows building higher-
• Frame modification consistency: State changes made
level instrumentation utilities for more expressive dynamic
by a probe are immediately applied, and execution after analyses.
a probe resumes with those changes.
10 Note:frames are by-definition thread local; races can only exist if the
9 Trueeven for baseline compilers like Wizard’s compiler, which perform monitor itself is multi-threaded and FrameAccessor objects are shared
limited optimizations like register allocation and constant propagation. racily.
6
2.6 After-instruction offers the perfect mechanism: the global probe. Instruction-
Some analyses, such as branch profiling or dynamic call level tracing in Wizard simply uses one global probe. Other
graph construction, are naturally expressed as M-code that than a short flag to enable it, there is nothing special about
should run after an instruction rather than before. For exam- this probe; it uses the standard FrameAccessor API as it
ple, profiling which functions are targets of a call_indirect prints instructions and the operand stack.
would be easiest if a probe could fire after the instruction The Coverage monitor measures code coverage. It in-
is executed and a frame for the target function has been serts a local probe at every instruction (or basic block), which,
pushed onto the execution stack. However, the API has no when fired, sets a bit in an internal datastructure and then
such functionality. removes itself. By removing itself, the probe will no longer
Should the VM support an “after-instruction” hook di- impose overhead, either in the interpreter or JITed code.
rectly? Interestingly, we find that like function entry/exit, Eventually, all executed paths in the program will be probe-
the unlimited programmability of probes allows us to invoke free and JITed code quality will asymptotically approach
M-code seemingly after instructions. For example, suppose zero overhead. This is a good example of a monitor using
we want to execute probe 𝑝 after a br_table (i.e. Wasm’s dynamic probe removal.
switch instruction). We identified at least three strategies: The Loop monitor counts loop iterations. It inserts CountProbes
at every loop header and then prints a nice report. This is a
• A probe 𝑞𝑝 executed before the br_table can use the
good example of a counter-heavy analysis.
FrameAccessor object to read the top (i32 value) of
The Hotness monitor counts every instruction in the
the operand stack, determine where the branch will go,
program. It inserts CountProbes at every instruction and
and dynamically insert probe 𝑝 at that location.
then prints a summary of hot execution paths. Another ex-
• Insert probes into all targets of the br_table. Since
ample of a counter-heavy analysis.
br_table has a fixed set of targets, we can insert probes
The Branch monitor profiles the direction of all branches.
once and use M-state to distinguish reaching each tar-
It instruments all if, br_if and br_table instructions and
get from the br_table versus another path. This only
uses the top-of-stack to predict the direction of each branch.
works in limited circumstances; other instructions like
It is a good example of non-trivial FrameAccessor usage.
call_indirect have an unlimited set of targets.
The Memory monitor traces all memory accesses. It
• Insert a global probe for just one instruction and remove
instruments all loads and stores and prints loaded and stored
it after. The probe will fire on the next instruction, wher-
addresses and values. Another good example of non-trivial
ever that is, then the probe will remove itself. For a use
FrameAccessor usage.
case like this, it’s important that dynamically enabling
The Debugger REPL implements a simple read-eval-print
global probes doesn’t ruin performance, e.g. by deopti-
loop that allows interactive debugging at the Wasm bytecode
mizing all JIT-compiled code. We show in Section 4.1
level. It supports breakpoints, watchpoints, single-step,
how dispatch-table switching can make this use case
step-over, and changing the state of value stack slots. It
efficient.
primarily uses local probes but uses a global probe to im-
With multiple strategies to emulate its behavior, an after- plement single-step functionality. This monitor is a good
instruction hook resides above global/local probes in the example of dynamic probe insertion and removal. It is also
instrumentation mechanism hierarchy. the only monitor (so far) that modifies frames.
The Calls monitor instruments callsites in the program
3 The Monitor Zoo and records statistics on direct calls and the targets of indirect
The wide variety and ease with which analyses are imple- calls. Its output can be used to build a dynamic call graph
mented11 showcases the flexibility of having a fully pro- from an execution.
grammable instrumentation mechanism in a high-level lan- The Call tree profiler measures execution time of func-
guage. Users activate monitors with flags when invoking tion calls and prints self and nested time using the full calling-
Wizard (e.g. wizeng --monitors=MyMonitor), which in- context tree. It can also produce flame graphs. It inserts local
strument modules at various stages of processing before probes at all direct and indirect callsites and all return lo-
execution and may generate post-execution reports. Exam- cations 12 . It is a good example of a monitor that measures
ples of monitors we have built include a variety of useful non-virtualized metrics like wall-clock time.
tools.
The Trace monitor prints each instruction as it is exe-
cuted. While many VMs have tracing flags and built-in modes
that may be spread throughout the code, Wizard already
11 Mostmonitors required a dozen or two lines of instrumentation code; in
fact, most lines are usually spent on making pretty visualizations of the
data!
7
Figure 2. Code generated by Wizard’s baseline JIT for different types of M-code implemented with probes. The machine code
sequence for generic probes is more general than for probes that only need the top-of-stack value, versus a fully-intrinsified
counter probe.
4 Optimizing probe overhead Dynamically adding and removing global probes shouldn’t
Optimizations in Wizard’s interpreter and JIT compiler re- ruin performance, as they might be used to implement “after-
duce overhead for both global and local probes, see the effec- instruction” or to trace a subset of the code, such as an indi-
tiveness of this technique in Section 5.3. We define overhead vidual function or loop. Our design further extends [55] by
as the execution time spent in neither application code nor supporting global probes without deoptimizing JITed code.
M-code, but in transitions between application and M-code This can be done by temporarily returning to the interpreter
or additional work in the runtime system and compiler. in the global probe mode. In global probe mode, a different
dispatch table is used, which, in addition to calling probes
4.1 Optimizing global probes in the interpreter for every instruction, can use special handlers for certain
bytecodes. For example, the loop bytecode does not check
Global probes, being the most heavyweight instrumenta-
for dynamic tier-up (which would cause a transfer to JITed
tion mechanism, are supported only in the interpreter. It is
code), call instructions reenter the interpreter (rather than
straightforward to add a check to the interpreter loop that
entering the callee’s JITed code, if any), and return returns
checks for any global probes at each instruction. However,
only to the interpreter (rather than the caller’s JIT code). Oth-
this naive approach imposes overhead on all instructions
erwise, JIT code remains in-place. Removing global probes
executed, even if global probes are not enabled. One option
leaves this mode and JIT code will naturally be reentered as
to avoid overhead when global probes are disabled is to have
normal. See Section 4.6 for how we guarantee consistency
two different interpreter loops, one with the check and one
after state modifications. To our knowledge, our design is
without, and dynamically switch between them. This comes
the first to support switching into a heavyweight instrumen-
at some VM code space cost, since it duplicates the entire
tation mode and back without discarding any JITed code,
interpreter loop and handlers. Another approach described
preserving performance.
in [55] avoids the code space cost by maintaining a pointer
to the dispatch table in a hardware register. When global 4.2 Optimizing local probes in the interpreter
probes are not in use, this register points to a “normal” dis-
Both Wizard’s interpreter and baseline JIT support local
patch table without instrumentation; inserting a global probe
probes. In the interpreter, local probes impose no overhead
switches the register to point to an “instrumented” dispatch
on non-probed instructions by using in-place bytecode modi-
table where all (256) entries point to a small stub that calls the
fications. With bytecode overwriting, inserting a local probe at
probe(s) and then dispatches to the original handler via the
a location 𝐿 overwrites its original opcode with an otherwise-
“normal” dispatch table. Both code duplication and dispatch-
illegal probe opcode. The original unmodified opcode is
table switching are suitable for production, as they allow the
saved on the side. When the interpreter reaches a probe
VM to support global probes while imposing no overhead
opcode, the Wasm program state (e.g. value stack) is already
when disabled.
up-to-date; it saves the interpreter state, looks up the lo-
cal probe(s) at the current bytecode location, and simply
12 Wizardhas preliminary support for the proposed Wasm exception han- calls that M-code callback. This is somewhat reminiscent
dling mechanism, but does not yet have monitoring hooks for unwind of machine code overwriting, a technique sometimes used
events. to implement debugging or machine code instrumentation
8
(Pin, gdb and DynamoRIO). However, our approach is vastly code. For the generic probe case, the JIT inserts a call to a
simpler and more efficient as it doesn’t require hardware generic runtime routine calls the user’s probe. For the next
traps or solving a nasty code layout issue —only a single more specialized case, the top-of-stack, it inserts a direct
bytecode is overwritten. call to the probe’s fire method, passing the top-of-stack
In Wizard, since the callback is compiled machine code, value, skipping the runtime call overhead and the cost of
the overhead is a small number of machine instructions to reifying an expensive FrameAccessor object. In general val-
exit the interpreter context and enter the callback context. ues from the frame can be directly passed from the JITed
After returning from M-code, 𝐿’s original opcode is loaded code to M-code. Lastly, for the counter probe, we see that
(e.g. by consulting an unmodified copy of the function’s Wizard’s JIT simply inlines an increment instruction to a
code) and executed. Removing a probe is as simple as copy- specific CountProbe object without looking it up.
ing the original bytecode back; the interpreter will no longer Other systems allow building custom inline M-code. For
trip over it. In contrast, Pin allows disabling by removing example, Pin offers using a type of macro-assembler that
all instrumentation from a specified region of the original builds IR that it compiles into the instrumented program,
code, which effectively reinstalls the original code, an all-or- which is very low-level, tedious, and error-prone.
nothing approach rather than having control at the probe
granularity. Overwriting has two primary advantages over
bytecode injection; the original bytecode offsets are main- 4.5 Monitor consistency for JITed code
tained, making it trivial to report locations to M-code, and We just saw how a JIT can inline M-code into the compiled
insertion/removal of probes is a cheap, constant-time opera- code. However, M-code can change as probes can be inserted
tion. Consistency is trivial; the bytecode is always up-to-date and removed during execution, making compiled code that
with the state of inserted instrumentation. has been specialized to M-code out-of-date. This problem
can be addressed by standard deoptimization techniques
4.3 Local probes in the JIT such as on-stack-replacement back to the interpreter and
In a JIT compiler, local probes can be supported by injecting invalidating relevant machine code. To our knowledge, no
calls to M-code into the compiled code at the appropriate prior bytecode-based system has employed deoptimization
places. Since probe logic could potentially access (and even to support dynamic instrumentation of an executing frame,
modify) the state of the program through the FrameAccessor, but offer only hot code replacement.
a call to unknown M-code must checkpoint the program and
VM-level state. For baseline code from Wizard’s JIT, the over-
4.6 Strategies for multi-tier consistency
head is a few machine instructions more than a normal call
between Wasm functions13 . Compilation speed is paramount There are several different strategies for guaranteeing moni-
to a baseline compiler, and bytecode parsing speed actually tor consistency in a multi-tier engine like Wizard. We iden-
matters. Similar to the benefits to interpreter dispatch, byte- tified four plausible strategies:
code overwriting avoids any compilation speed overhead
because the probe opcode marks instrumented instructions 1. When instrumentation is enabled, disable the JIT.
and additional checks aren’t needed. Overall, supporting 2. When instrumentation is enabled, disable only relevant
probes adds little complexity to the JIT compiler; in Wizard’s JIT optimizations.
JIT, it requires less than 100 lines of code. 3. Upon frame modification, recompile the function under
different assumptions about frame state and perform
4.4 JIT intrinsification of probes on-stack-replacement from JITed to JITed code.
While probes are a fully-programmable instrumentation 4. Upon frame modification, perform on-stack-replacement
mechanism to implement unlimited analyses, there are a from JITed code to the interpreter.
number of common building blocks such as counters, switches,
and samplers that many different analyses use. For logic as Strategy 1) is the simplest to implement for engines with
simple as incrementing a counter every time a location is interpreters, but slow. A production Wasm engine could
reached, it is highly inefficient to save the entire program achieve functional correctness and the key consistency guar-
state and call through a generic runtime function to execute antees at little engineering cost, leaving instrumented per-
a single increment to a variable in memory. Thus, we imple- formance as a later product improvement. Strategy 2) elimi-
mented optimizations in Wizard’s JIT to intrinsify counters nates interpreter dispatch cost, but, ironically, is actually a
as well as probes that access limited frame state. lot of work in practice, since it introduces modes into the JIT
Figure 2 shows how Wizard’s baseline JIT optimizes dif- compiler and optimizations must be audited for correctness.
ferent kinds of probes. At the left, we have uninstrumented The compiler becomes littered with checks to disable opti-
mization and ultimately the JIT emits very pessimistic code.
13 Primarily because the calling convention models an explicit value stack. Strategy 3) has other implications for JIT compilation, such
9
Hotness (Local probes) Branches (Local probes) 5 Evaluation
Relative Execution Time (log scale)
su iso d
dummlv
-w my
ars lp
hau
ll
r v
bbicin
ge amtagx
to rmvevrt
coorre itsgemn
va latyrk
gra gsyamncne
ms semm
chyr2m
mk
nufdt2dmidmt
jac s3in-2od
obmmv
d ad
che el- di
flo hluoalet-32dd
hau
ll
-w my
ars lp
We evaluate the performance of Wizard by executing Wasm
yd dcsk
yd dcsk
ge otrbi-1
ri io
ge otrbi-1
ri io
sei i-2
m
m
o
s
jac
jac
d
d
c
c
code under both the interpreter and JIT using different mon-
itors and measure total execution time of the entire program,
Figure 3. Average relative execution time for the hotness including engine startup and program load. We chose the
monitor (left) and branch monitor (right), when implemented “hotness” and “branch” monitors (described in Section 3). The
with local probes and when implemented with a global probe hotness monitor instruments every instruction17 with a local
on the PolyBenchC suite. Points above the bars denote num- CountProbe, which is representative of monitors with many
ber of probe fires. simple probes. The branch monitor probes branch instruc-
tions and tallies each destination by accessing the top of the
operand stack. Compared to the hotness monitor, probes in
the branch monitor are more sparse but more complex.
as requiring support for arbitrary OSR locations14 , which is
These monitors were chosen because they strike a balance
also significant engineering work.
between being powerful enough to capture insights about
In Wizard, we chose strategy 4, which we believe to be not
the execution of a program, yet simple enough to be imple-
only the simplest, but most robust. Frame modifications trig-
mented in other systems. They are also likely to instrument
ger immediate deoptimization of only the modified frame15 ,
a nontrivial portion of program bytecode.
rewriting it in place to return to the interpreter. In the dy-
Benchmark Suites. We run Wasm programs from three
namic tiering configuration mode, sending an execution
benchmark suites: PolyBench/C [45] with the medium dataset,
frame back to the interpreter due to modification doesn’t
Ostrich [29] and Libsodium [22] and average execution time
banish it there forever; if it remains hot, it can be recompiled
over 5 runs.
under new assumptions16 . This means frame modification
Given instrumented execution time𝑇𝑖 and uninstrumented
support requires the interpreter; Wizard will not allow mod-
execution time 𝑇𝑢 , we define absolute overhead as the quan-
ifications in the JIT-only configuration.
tity 𝑇𝑖 − 𝑇𝑢 and relative execution time as the ratio 𝑇𝑖 /𝑇𝑢 . We
Inserting or removing probes in a function also triggers
report relative execution time for Wizard’s interpreter, Wiz-
deoptimization of JITed code for the function and sends ex-
ard’s JIT (with and without intrinsification), DynamoRIO,
isting frames back to the interpreter. This is different than
Wasabi, and bytecode rewriting in Figures 6 and 7.
a frame modification, because the JIT may have specialized
the code to instrumentation at the time of compilation; the
code is actually invalid w.r.t. the instrumentation it should 5.2 Global vs local probes
execute. Like with frame modifications, hot functions will Global probes can emulate the behavior of local probes, but
eventually be recompiled. It’s likely that such highly dy- impose a greater performance cost by introducing checks at
namic instrumentation scenarios would perform better by every bytecode instruction. We compare two implementa-
using M-state to enable and disable their probes rather than tions of the branch and hotness monitors, one using a global
repeatedly inserting and removing them, which confounds probe and the other using local probes. Both are executed
engine tiering heuristics. in Wizard’s interpreter, since Wizard’s JIT doesn’t support
global probes. The results can be found in Figure 3. For the
hotness monitor, since the number of probe fires is the same
14 Most JITs that allow tier-up OSR into compiled code only do so at loop
for local and global probes, the relative overhead is similar
headers. across all programs. For the branch monitor, local probes on
15We observe that the JIT-compiled code for a function is not invalid, it branch instructions have relative execution times between
is only the state of the single frame that now differs from assumptions in 1.0–2.2×, whereas it is between 7.7–16.4× for global probes.
the JIT code. New calls to the involved function can still legally enter the
existing JIT code.
16 Pathological cases can occur where hot frames are repeatedly modified,
co itg m
mv v
aticgr
trmmavxt
covar syen
rre ian rk
gra gemtione
ms ssyyr2m
ch m k
mm
fn dt2midt
us d m
jac s3in-2od
heobmi- mv
ch at-32dd
flo s ludlecskdyi
yd ei mp
arsl-2u
had
ll
ge dtruri-b1d
geumismoilnv
mv v
aticgr
to rmmavxt
co itg m
covar syen
rre ian rk
gra gemtione
ms ssyyr2m
ch m k
mm
nufdt2dmidmt
jac s3in-2od
heobmi- mv
ch at-32dd
flo s ludlecskdyi
yd ei mp
arsl-2u
had
ll
be
be
la c
la c
-w de l
-w de l
o a
o a
ob
s
jac
jac
d
d
s
s
mates 𝑇PD + 𝑇JIT ;
Figure 4. Average relative execution times for the hotness 3. The instrumented execution time with actual probes,
(left) and branch monitors (right), with and without probe which gives 𝑇PD + 𝑇𝑀 + 𝑇JIT .
intrinsification on the PolyBenchC suite. Ratios are relative
to uninstrumented JIT execution time. Points above the bars
denote number of probe fires. The results of this analysis for the branch and hotness
monitors are in Figure 5. Execution time without JIT intrin-
sification is shown as the entire bar for each program. The
probe-dispatch-logic probe-dispatch-logic cross-hatched portions of each bar represent the execution
Execution Time (% of total runtime)
ge trirbidn
geummlv
mv v
b er
atiacg
mx
dotrmvmt
covari syrn
rre an k
geatione
gra symm
ms sy r2k
ch mm
2 idt
nufdtdm- m
ss 2d
jac 3imnov
heobi-2m
at d
ch -a3d
ludleskdi
flo s cmpy
-w el- lu
ars 2d
ll
2 id
ha
ha
l c
l c
dbi-1
co itge
dubi-1
co itge
s so
m
m
s s
o
jac
jac
Native (x86-64, DynamoRIO) V8 (Wasabi) Wizard (Interpreter) Wizard (JIT intrins.) Wizard (JIT) Bytecode rewriting (Wizard, JIT)
103
102
101
100
103
102
101
100
0
du 1d
tris in
rc
...
on stre olv
m 3
x
su h
v
t
ge icg
er
re h3
ret 2
au x
trm 2
sip ortha 6
bo m
do th3
en
co sy h
va rk
sy e
co eyg k
rre en
sym2d
ch d
sca idel-2 .
lar d
lav a20
bo amd
mb w
s
nq lud
m
ha sh
sec hx24
au ...
fdt ion
gra jacob m
2m t
m
th
nu kdf
lar ov
3 d
ge box_ mm
sca armu 2
ch lu
lud sky
p
lar adi
he ult2
sca icha d
lar lt6
lt5
lt7
x
ad ar 2
ac a...
x t
ac y
l
fft
ck- h ns
pa m
ea
se a2..
mv
bo mul
yte
bo
bo
k r2
ata
mm
ch _eas
nc
eti am
ge eaut
ms i-2
cm
sec tbox
th
sh auth
r e
l sh
ae yd-w box
do n
rb
ga
a
ret gem
pro m
mv
mi
au
ha
e
sca ssin
itg
ne se
i-
sec as
d-
at-
mu
mu
b
x_s
lat
x_e
_ch sh
ole
ue
ria
h
ob
h
h
s
jac
sca
ran
flo
ba
Figure 6. Relative execution times of the hotness monitor (bottom) and branch monitor (top) in Wizard, Wasabi, and
DynamoRIO across all programs on all suites, sorted by absolute execution time. Ratios are relative to uninstrumented
execution time.
102 102
103 103 Wasm engine that also runs JavaScript, such as V8 [5]. For
this comparison, we use V8 in its default mode (two com-
(log scale)
102 102
piler tiers)18 . Figure 6 includes data for Wasabi on v8. Wasabi
101 101 instrumentation is vastly slower than Wizard instrumenta-
100 polybench libsodium ostrich 100 polybench libsodium ostrich
tion due to the overhead of calling JavaScript functions. On
average, a hotness monitor in Wasabi increases execution
time 36.8–6350.2×, compared to 7–134× for Wizard’s JIT
Figure 7. Mean relative execution times of the hotness mon- (or 2.2–7.7× with intrinsification). The branch monitor also
itor (left) and branch monitor (right) in Wizard, Wasabi, and has a drastic performance impact of 29.9–4721.5× in Wasabi,
DynamoRIO across the three suites. Ratios are relative to compared to 1.0–16.6× for Wizard’s JIT (or 1.0–2.8× with
uninstrumented execution time. intrinsification).
15
[46] David Georg Reichelt, Stefan Kühne, and Wilhelm Hasselbring. To- NY, USA, 2018. Association for Computing Machinery.
wards solving the challenge of minimal overhead monitoring. In [63] Matthias Wenzl, Georg Merzdovnik, Johanna Ullrich, and Edgar
Companion of the 2023 ACM/SPEC International Conference on Perfor- Weippl. From hack to elaborate technique – a survey on binary
mance Engineering, ICPE ’23 Companion, page 381–388, New York, rewriting. ACM Comput. Surv., 52(3), jun 2019.
NY, USA, 2023. Association for Computing Machinery. [64] Zhiqiang Zuo, Kai Ji, Yifei Wang, Wei Tao, Linzhang Wang, Xuandong
[47] João Rodrigues and Jorge Barreiros. Aspect-oriented WebAssembly Li, and Guoqing Harry Xu. JPortal: Precise and efficient control-flow
transformation. In 2022 17th Iberian Conference on Information Systems tracing for JVM programs with Intel Processor Trace. In Proceedings
and Technologies (CISTI), pages 1–6, 2022. of the 42nd ACM SIGPLAN International Conference on Programming
[48] Ted Romer, Geoff Voelker, Dennis Lee, Alec Wolman, Wayne Wong, Language Design and Implementation, PLDI 2021, page 1080–1094,
Hank Levy, Brian Bershad, and Brad Chen. Instrumentation and New York, NY, USA, 2021. Association for Computing Machinery.
optimization of Win32/Intel executables using Etch. In Proceedings
of the USENIX Windows NT Workshop on The USENIX Windows NT
Workshop 1997, NT’97, page 1, USA, 1997. USENIX Association. A Artifact Appendix
[49] Koushik Sen, Swaroop Kalasapur, Tasneem Brutch, and Simon Gibbs.
Jalangi: A selective record-replay and dynamic analysis framework for
A.1 Abstract
JavaScript. In Proceedings of the 2013 9th Joint Meeting on Foundations This artifact description contains information for how to re-
of Software Engineering, ESEC/FSE 2013, page 488–498, New York, NY, produce all results in this paper. We describe system require-
USA, 2013. Association for Computing Machinery.
ments, how to set up an environment and run our scripts that
[50] Chukri Soueidi, Ali Kassem, and Yliès Falcone. BISM: bytecode-level
instrumentation for software monitoring. In Runtime Verification: 20th produce data and exact figures in the paper, as well as how to
International Conference, RV 2020, Los Angeles, CA, USA, October 6–9, modify the artifact to run your own custom experiments. Our
2020, Proceedings 20, pages 323–335. Springer, 2020. package contains all scripts, benchmarks, monitors, and en-
[51] A. Srivastava, A. Edwards, and H. Vo. Vulcan: Binary transformation gines used. We also provided all of our results in the package
in a distributed environment. Technical report, Microsoft Research,
so others can do direct data comparison.
2001.
[52] Amitabh Srivastava and Alan Eustace. ATOM: A system for build-
ing customized program analysis tools. New York, NY, USA, 1994. A.2 Artifact Meta-Information
Association for Computing Machinery. • Benchmarks: The following benchmarking suites are used
[53] Ben L. Titzer. Harmonizing classes, functions, tuples, and type param- in our experiments:
eters in Virgil III. In Proceedings of the 34th ACM SIGPLAN Conference
– PolyBench/C [45] with the medium dataset, version 4.2.
on Programming Language Design and Implementation, PLDI ’13, page
85–94, New York, NY, USA, 2013. Association for Computing Machin- – Ostrich [29], version 1.0.0.
ery. – Libsodium [22], there are three different variations of the
[54] Ben L. Titzer. Wizard, An advanced Webassembly Engine for Research. Libsodium benchmark as follows:
https://github.com/titzer/wizard-engine, 2021. (Accessed 2021-07-29). ∗ libsodium, the base libsodium suite, version 0.7.13.
[55] Ben L. Titzer. A fast in-place interpreter for WebAssembly. Proc. ACM ∗ libsodium-2021, a variation pulled from the 2021-
Program. Lang., 6(OOPSLA2), October 2022. Q1 directory at https://github.com/jedisct1/webassembly-
[56] Ben L. Titzer. Whose baseline compiler is it anyway? CGO ’24, New benchmarks.
York, NY, USA, 2024. Association for Computing Machinery. ∗ libsodium-no-bulk-mem, a variation of base lib-
[57] Ben L. Titzer, Daniel K. Lee, and Jens Palsberg. Avrora: Scalable sensor sodium above without bulk memory operations.
network simulation with precise timing. In Proceedings of the 4th
All of these benchmark suites have been included in the suites
International Symposium on Information Processing in Sensor Networks,
IPSN ’05, page 67–es. IEEE Press, 2005.
directory of the artifact.
[58] Ben L. Titzer and Jens Palsberg. Nonintrusive precision instrumen- • Compilation: Since we provide the benchmarks compiled
tation of microcontroller software. In Proceedings of the 2005 ACM to Wasm, a Wasm compiler is unnecessary. However, a Rust
SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for compiler for the wasm-bytecode-instrumenter and Wasabi
Embedded Systems, LCTES ’05, page 59–68, New York, NY, USA, 2005. tools is necessary. See the next section for the required version
Association for Computing Machinery. of rustc.
[59] Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick • Transformations: For the bytecode rewriting experiment,
Lam, and Vijay Sundaresan. Soot: A Java bytecode optimization we use Walrus (version 0.20.1), a Rust library for Wasm trans-
framework. In CASCON First Decade High Impact Papers, CASCON formations, in our wasm-bytecode-rewriter to inject our
’10, page 214–224, USA, 2010. IBM Corp.
Wasm instrumentation. This crate’s repo is publicly avail-
[60] Kenton Varda. WebAssembly on Cloudflare Workers. https://
blog.cloudflare.com/webassembly-on-cloudflare-workers/. (Accessed able at: https://github.com/rustwasm/walrus. Wasabi also does
2021-07-06). Wasm transformations to inject calls, follow instructions in
[61] Mingzhe Wang, Jie Liang, Chijin Zhou, Zhiyong Wu, Xinyi Xu, and the README.md file at the root directory of the artifact for how
Yu Jiang. Odin: On-demand instrumentation with on-the-fly recompi- to set up this tool.
lation. In Proceedings of the 43rd ACM SIGPLAN International Confer- • Binaries: We used Wizard [54] version 23a.1617 for our ex-
ence on Programming Language Design and Implementation, PLDI 2022, perimentation. Since this version did not have support for
page 1010–1024, New York, NY, USA, 2022. Association for Computing enabling/disabling certain features via command line flags, we
Machinery. directly manipulated flags in the source code and compiled
[62] Conrad Watt. Mechanising and verifying the WebAssembly specifica- each required Wizard configuration for our experiments and
tion. In Proceedings of the 7th ACM SIGPLAN International Conference
made them available in the bin folder. The following describes
on Certified Programs and Proofs, CPP 2018, page 53–65, New York,
each binary’s configuration:
16
– base/wizeng.x86-64-linux. This is the base compila- A.3.2 Software dependencies. This artifact has the fol-
tion of Wizard with no flags modified. lowing software dependencies:
– local-global/wizeng.x86-64-linux. In order to use a
• V8 [5], commit hash f200321. V8 can be downloaded
monitor in Wizard, its code must be present in the binary.
We extended the base Wizard to contain both the local
from https://github.com/v8/v8. To run experiments
and global implementations of the hotness and branch scripts, this binary must be available on the PATH to
monitors in this binary. be called with the d8 command.
– fast-count/wizeng.x86-64-linux. This binary was com- • Wasabi [33], commit hash fe12347. Wasabi can be
piled with the intrinsifyCountProbe and downloaded from https://github.com/danleh/wasabi.
intrinsifyOperandProbe flags enabled in the To run experiments scripts, this binary must be avail-
src/engine/Tuning.v3 file. able on the PATH to be called with the wasabi com-
– empty-probes/wizeng.x86-64-linux. To reiterate, in or- mand.
der to use a monitor in Wizard, its code must be present • DynamoRIO [18], commit hash fc4c25f. DynamoRIO can
in the binary. We extended the base Wizard to contain be downloaded from https://github.com/DynamoRIO/
variations of the hotness and branch monitors with no
dynamorio. To run experiments scripts, this binary
M-code in their inserted probes.
– empty-probes-fast-count/wizeng.x86-64-linux. This
must be available on the PATH to be called with the
binary is a combination of the above fast-count and drrun command.
empty-probes configurations. • Python, version 3.8.10 To run experiment scripts, both
We have also included the binary btime in the bin folder the python3 and python (symlinked to python3) com-
that calculates various types of timing characteristics of a mands must be available on the PATH.
program’s execution. • Python’s venv package, python3.8-venv for
• Run-time environment: This project must be run on an Debian/Ubuntu systems
x86_64 Linux machine. It is not necessary to have sudo ac- • wasm-bytecode-instrumenter, commit hash 3ea2003.
cess as software can be installed/symlinked in a user’s home The bytecode-instrumenter can be downloaded from
directory. However, having sudo access would substantially the repo. To run experiments scripts, this binary must
simplify the installation process.
be available on the PATH to be called with the follow-
• Metrics: We report relative execution time and absolute over-
head for Wizard’s interpreter, Wizard’s JIT (with and without
ing command: wasm-bytecode-instrumenter
intrinsification), DynamoRIO, Wasabi, and bytecode rewriting. • rustc, version 1.71.0.
Given instrumented execution time 𝑇𝑖 and uninstrumented
execution time 𝑇𝑢 , we define absolute overhead as the quantity A.4 Installation and Testing
𝑇𝑖 − 𝑇𝑢 and relative execution time as the ratio 𝑇𝑖 /𝑇𝑢 . A.4.1 Installation. To install all required dependencies,
• Output: There are two different outputs for our scripts. The follow the detailed instructions in the README.md file in the
experiment*.bash scripts output CSV files. The plot*.py base directory of the artifact.
scripts output graphs as PDF and SVG files. We have provided
our own results inside the csv and figures folders for com- A.4.2 Basic Test. To verify that an environment is cor-
parison. rectly configured to run all scripts provided in this artifact,
• Experiments: Follow the instructions in the README.md file edit the SUITES variable in the common.bash file to only con-
at the base directory of the artifact for how to run experiments. tain the polybench suite, then run the following command:
• How much disk space required (approximately)?: 10 GB RUNS=2 ./experiment-all-suites.sh. We expect this ini-
• How much time is needed to prepare workflow (approx- tial test to run in about 1 day (as opposed to the 7 days for
imately)?: We expect experiment preparation to take around
all experiments as mentioned above). When running experi-
30 minutes (if all builds/installs work well).
• How much time is needed to complete experiments (ap-
ments for polybench, a successful run should result in the
proximately)?: To run 5 iterations per experiment, you should CSV directory containing subfolders with CSV output files
expect the runtime to take about 7 days when running on an for each suite script. Refer to the README for instructions on
Ubuntu 20.04.5 machine with 19 GiB of RAM and an Intel® how to run individual experiments.
Core™ i7-4790 processor running at 3.60 GHz. This is primarily
due to Wasabi being significantly slower with instrumentation. A.5 Experiment workflow
• Publicly available?: Yes, this artifact is available at the fol- The workflow of our experiments has two phases: collecting
lowing URL: https://zenodo.org/doi/10.5281/zenodo.10795556 runtime data (saved to CSV files) and generating the corre-
• Code license: Licensed under the Apache License, Version sponding figures (saved to PDF and SVG files). It is important
2.0 (https://www.apache.org/licenses/LICENSE-2.0).
to remember to save off any data/figures by copying the
csv/figures folder to alternate locations prior to running
A.3 Description scripts. If this is not done, the contents will be overwrit-
A.3.1 How to access. This artifact can be accessed at: ten. To collect the runtime data, a user can run any of the
https://zenodo.org/doi/10.5281/zenodo.10795556 experiment*.bash scripts (experiment-all-suites.bash
17
to run all experiments). Logging information will be output • Metrics: Our scripts report relative execution time and absolute
to stdout. Before plotting data, all experiments should be suc- overhead for JVMTI and Wizard. To ignore the base startup
cessfully run (as some figures require data across multiple time required by the engines, we measure instrumented and
experiments). To generate figures, run the plot-figure*.py uninstrumented runs of the Richards benchmark with 0 loops
scripts. (𝑇𝑏𝑖 and 𝑇𝑏𝑢 below). Given:
– instrumented execution time 𝑇𝑖
A.6 Evaluation and expected results – instrumented base execution time 𝑇𝑏𝑖
– uninstrumented execution time 𝑇𝑢
If you run these experiments, you will find the generated – uninstrumented base execution time 𝑇𝑏𝑢
CSV and figure files in their respective folders. To verify our We define absolute overhead as the quantity (𝑇𝑖 − 𝑇𝑏𝑖 ) − (𝑇𝑢 −
own results, a side-by-side comparison can be done with the 𝑇𝑏𝑢 ) and relative execution time as the ratio (𝑇𝑖 − 𝑇𝑏𝑖 )/(𝑇𝑢 −
figures in our paper. 𝑇𝑏𝑢 ).
• Output: We log all of our output to stdout which should be
A.7 Experiment customization redirected to a file for inspection. To view the summary of
To add your own suite: each Richard benchmark iteration, grep the file for the term
SUMMARY. The specific iteration of the Richard benchmark
1. Compile your suite to Wasm. The binary can only
is shown in the prefix of each line, e.g. [wasm-9-SUMMARY]
contain Wasm features supported by the Wasabi and means that this line is part of the summary of the Wasm execu-
Walrus tools, which tends to be aligned with the core tion of the Richard benchmark with 9 iterations. Each iteration
specification. is summarized by outputting all execution times for instru-
2. Make your new suite available in the suites folder mented and uninstrumented variants of the Java and Wasm
following the conventions shown by the other avail- executions, then reporting the absolute overhead and relative
able suites. execution time. To view the absolute overhead, grep the file
3. Update the SUITES variable in common.bash to con- for On average, runtime with monitor took. To view the relative
tain your new suite. execution time for each Richard benchmark iteration, grep
4. Update the suites variable in plot.py to contain the file for the term Factor. We have included our own results
your new suite. inside the runs_richards directory for reference.
• Experiments: The artifact scripts run instrumented and unin-
Helpful variables in common.bash: strumented variations of the Richards benchmark at 9, 99, 999,
1. RUNS: configure the number of times run to collect 9999, and 99999 loops. Each of these variations is run 10x to
average execution times. collect execution time averages across all runs.
2. SUITES: configure the suites that will run during ex- • How much disk space required (approximately)?: Re-
perimentation. quires about 2 MB for the jvmti directory and the Wizard
engine binary.
A.8 JVMTI Experiment Artifact • How much time is needed to prepare workflow (approx-
imately)?: Shouldn’t take longer than 30 minutes since there
A.8.1 Abstract. We also did a brief experiment discussed
are few dependencies.
in Related Work, Section 6, to assess the performance over- • How much time is needed to complete experiments (ap-
head imposed by JVMTI’s [3] handling of MethodEntry events. proximately)?: The experiment takes about 3 hours when
To keep our core evaluations separate from this experiment, running on an Ubuntu 20.04.6 machine with 394 GB of RAM
we have placed this artifact discussion below. This can be and an Intel®Xeon®Platinum 8168 processor running at 2.70
found in the directory jvmti at the base of the artifact lo- GHz.
cated at https://zenodo.org/doi/10.5281/zenodo.10795556. • Publicly available?: Yes, this artifact is available at the fol-
lowing URL: https://zenodo.org/doi/10.5281/zenodo.10795556
A.8.2 Meta-Information. • Code license: Licensed under the Apache License, Version
• Benchmarks: We leveraged the Richards benchmark for ex- 2.0 (https://www.apache.org/licenses/LICENSE-2.0).
perimenting with JVMTI. An equivalent Richards benchmark
for Java and Wasm has been provided as part of this artifact.
• Compilation: To compile the CallsMonitor, we require gcc
A.8.3 Software dependencies. Running the JVMTI ex-
version 9.4.0 to be installed. To compile and run the Richards periment requires following software dependencies:
Java benchmark, we require Java version 1.8 to be installed.
• Binaries: We used the base Wizard binary in our experiment, • Java, version 1.8
version 23a.1617. This binary has been provided as part of • gcc, version 9.4.0
the artifact at the location bin/base/wizeng.x86-64-linux.
• Run-time environment: We ran this experiment on an x86_64
A.9 Installation and Testing
machine running Ubuntu 20.04.1. It is not necessary to have
sudo access as software can be installed/symlinked in a user’s A.9.1 Installation. To install all required dependencies
home directory. However, having sudo access would substan- to run the scripts, follow the detailed instructions in the
tially simplify the installation process. jvmti/README.md file.
18
A.9.2 Basic Test. The jvmti/README.md file also describes It is possible that the absolute overhead has variations due to
how to run a basic test to verify your environment setup. differences in the underlying system; however, the relative
execution time should be similar.
A.10 Experiment workflow
The execution workflow is pretty straightforward and out- A.12 Experiment customization
puts all logging information to stdout. The run_richards.sh As described above, it is possible to customize the execution
script iterates over the different loops to execute (configured by manipulating the following variables:
with the BENCHMARKS variable). For each of these loops to • NUM_RUNS: Fluctuate the number of times each experi-
execute, it calculates the average absolute overhead and aver- ment is run to get an average across execution time.
age relative execution time over 10 runs (configured with the • BENCHMARKS: Configure the loops to run on the Richards
NUM_RUNS variable) when running the Richards benchmark benchmark.
for both Java, on the JVM, and Wasm, on Wizard. If there are
issues during execution, errors or warnings will be output. A.13 Methodology
Submission, reviewing and badging methodology:
A.11 Evaluation and expected results
• https://www.acm.org/publications/policies/artifact-review-
Evaluating the results can be done by grep-ing for various and-badging-current
types of information as described under “Output” in the Meta- • http://cTuning.org/ae/submission-20201122.html
Information section above. The results should be similar to • http://cTuning.org/ae/reviewing-20201122.html
what our own execution found, located in jvmti/runs_richards.
19